Convergence Proofs of Least Squares Policy Iteration Algorithm for High-Dimensional Infinite Horizon Markov Decision Process Problems
نویسندگان
چکیده
Most of the current theory for dynamic programming algorithms focuses on finite state, finite action Markov decision problems, with a paucity of theory for the convergence of approximation algorithms with continuous states. In this paper we propose a policy iteration algorithm for infinite-horizon Markov decision problems where the state and action spaces are continuous and the expectation cannot be computed exactly. We show that an appropriately designed least squares (LS) or recursive least squares (RLS) method is provably convergent under certain problem structure assumptions on value functions. In addition, we show that the LS/RLS approximate policy iteration algorithm converges in the mean, meaning that the mean error between the approximate policy value function and the optimal value function shrinks to zero as successive approximations become more accurate. Furthermore, the convergence results are extended to the more general case of unknown basis functions with orthogonal polynomials.
منابع مشابه
Convergence Analysis of On-Policy LSPI for Multi-Dimensional Continuous State and Action-Space MDPs and Extension with Orthogonal Polynomial Approximation
We propose an online, on-policy least-squares policy iteration (LSPI) algorithm which can be applied to infinite horizon problems with where states and controls are vector-valued and continuous. We do not use special structure such as linear, additive noise, and we assume that the expectation cannot be computed exactly. We use the concept of the post-decision state variable to eliminate the exp...
متن کاملRisk-averse dynamic programming for Markov decision processes
We introduce the concept of a Markov risk measure and we use it to formulate risk-averse control problems for two Markov decision models: a finite horizon model and a discounted infinite horizon model. For both models we derive risk-averse dynamic programming equations and a value iteration method. For the infinite horizon problem we also develop a risk-averse policy iteration method and we pro...
متن کاملConvergence Analysis of Kernel-based On-policy Approximate Policy Iteration Algorithms for Markov Decision Processes with Continuous, Multidimensional States and Actions
Using kernel smoothing techniques, we propose three different online, on-policy approximate policy iteration algorithms which can be applied to infinite horizon problems with continuous and vector-valued states and actions. Using Monte Carlo sampling to estimate the value function around the post-decision state, we reduce the problem to a sequence of deterministic, nonlinear programming problem...
متن کاملPoint-Based Policy Iteration
We describe a point-based policy iteration (PBPI) algorithm for infinite-horizon POMDPs. PBPI replaces the exact policy improvement step of Hansen’s policy iteration with point-based value iteration (PBVI). Despite being an approximate algorithm, PBPI is monotonic: At each iteration before convergence, PBPI produces a policy for which the values increase for at least one of a finite set of init...
متن کاملConvergence Results for Some Temporal Difference Methods Based on Least Squares Citation
We consider finite-state Markov decision processes, and prove convergence and rate of convergence results for certain least squares policy evaluation algorithms of the type known as LSPE( ). These are temporal difference methods for constructing a linear function approximation of the cost function of a stationary policy, within the context of infinite-horizon discounted and average cost dynamic...
متن کامل